EDA
Now that we have presented the variables contained in the dataset, let’s try to understand the data structure, characteristics and underlying patterns thanks to an EDA.
Dataset overview
Columns description
Let’s have a quick look at the characteristics of the columns. You will find more statistical details about it in the annexe.
Code
# Get the number of rows and columns
num_rows <- nrow(data)
num_cols <- ncol(data)
# Get the frequency of column types
col_types <- sapply(data, class)
type_freq <- table(col_types)
char_count <- ifelse("character" %in% names(type_freq), type_freq["character"], 0)
num_count <- ifelse("numeric" %in% names(type_freq), type_freq["numeric"], 0)
# Create a summary data frame
summary_df <- data.frame(
Name = "data",
Number_of_rows = num_rows,
Number_of_columns = num_cols,
Character = char_count,
Numeric = num_count,
Group_variables = "None"
)
# Display the summary using kable
kable(summary_df, format = "html", table.attr = "style='width:50%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| Name | Number_of_rows | Number_of_columns | Character | Numeric | Group_variables |
|---|---|---|---|---|---|
| data | 45896 | 26 | 8 | 18 | None |
The dataset that we are working with contains approx. 46’000 rows and 26 columns, each row representing a model from one of the 141 brands. From the data overview, we can see that most of our features are concerning the consumption of the cars. If we now check more in details in the annex, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will handle these issues in the section “Data cleaning”.
Exploration of the distribution
Now let’s explore the distribution of the numerical features.
Code
# melt.data <- melt(data)
#
# ggplot(data = melt.data, aes(x = value)) +
# stat_density() +
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 

As the majority of models in our dataset are neither electric vehicles (EVs) nor hybrid cars and because of the nature of some column concerning only these two types of vehicles, the results are showing numerous zero values in several columns. This issue will be addressed during the data cleaning process. Additionally, certain features, such as “Engine Cylinders,” are numerically discrete, as illustrated in the corresponding plot.
Outliers Detection
In order identify occurences that deviate significantly for the rest of the observations, and in order to potentially improve the global quality of the data, we have decided to analyse outliers thanks to boxplots. Here are the result on the numerical columns of the dataset:
Code
#tentative boxplots
data_long <- data %>%
select_if(is.numeric) %>%
pivot_longer(cols = c("ID",
"model_year",
"estimated_Annual_Petrolum_Consumption_Barrels", "City_MPG_Fuel_Type_1",
"highway_mpg_fuel_type_1",
"combined_MPG_Fuel_Type_1",
"City_MPG_Fuel_Type_2",
"highway_mpg_fuel_type_2",
"combined_MPG_Fuel_Type_2",
"time_to_Charge_EV_hours_at_120v_",
"charge_time_240v",
"range_for_EV",
"range_ev_city_fuel_type_1",
"range_ev_city_fuel_type_2",
"range_ev_highway_fuel_type_1",
"range_ev_highway_fuel_type_2"), names_to = "variable", values_to = "value")
ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
geom_boxplot(outlier.size = 0.5) + # Make outlier points smaller
facet_wrap(~ variable, scales = "free_y") + # Each variable gets its own y-axis
theme_minimal() +
theme(legend.position = "none", # Hide the legend
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 0),strip.text = element_text(size = 7)) + # Rotate x-axis labels
labs(title = "Outlier Detection Boxplots", x = "", y = "Values")
Most of our boxplots are showing extreme values. Again, this is due to the small amount of EV and hybrid cars in our dataset compared to the rest of the models and due to the nature of some features, concerning only those type of vehicles.
Number of models per make
Now let’s check how many models per make we have in our dataset. In order to have a clear plot, we have decided to keep the top 20 brands among all the make on the graph. All the remaining makes are accessible on the table just below.
Code
#Number of occurences/model per make
# nb_model_per_make <- data %>%
# group_by(make, model) %>%
# summarise(Number = n(), .groups = 'drop') %>%
# group_by(make) %>%
# summarise(nb_model_per_make = n(), .groups = 'drop') %>%
# arrange(desc(nb_model_per_make))
nb_model_per_make <- data %>%
group_by(make) %>%
summarise(Number = n(), .groups = 'drop')
#makes with only 1 model
only_one_model_cars <- nb_model_per_make %>%
filter(Number == 1 )
#table globale
datatable(nb_model_per_make,
rownames = FALSE,
options = list(pageLength = 5,
class = "hover",
searchHighlight = TRUE))Code
# Option to limit to top 20 makes for better readability
# top_n_makes <- nb_model_per_make %>% top_n(20, nb_model_per_make)
top_n_makes <- nb_model_per_make %>%
arrange(desc(Number)) %>% # Sort in descending order by the column 'count'
slice_max(Number, n = 20)
#plot
ggplot(top_n_makes, aes(x = reorder(make, Number), y = Number)) +
geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
labs(title = "Models per Make (Top 20)",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(hjust = 1, size = 10),
plot.title = element_text(size = 14)) +
coord_flip() # Flip coordinates for better readability
On the 141 brands, we notice that only 13 brands have more than 100 models in the dataset. Among these, only two of them (Mercedes-Benz and BMW) have more than 400 models presents. Let’s now check the number of make with 1 model in the dataset.
Code
#table only 1 model
datatable(only_one_model_cars,
rownames = FALSE,
options = list(pageLength = 5,
class = "hover",
searchHighlight = TRUE))Code
ggplot(only_one_model_cars, aes(x = make, y = Number)) +
geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
labs(title = "Makes with only 1 model in the dataset",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(hjust = 1, size = 10),
plot.title = element_text(size = 14)) +
coord_flip() # Flip coordinates for better readability
Therefore, we can see that Mercendes-Benz and BMW have significantly more models in our dataset, which means that we are dealing with some imbalances in categories. Therefore, we need to be careful when doing predictions, as will may encounter bias toward these two majority classes. Therefore, there are few technics that can be used to deal with this problem, such as resampling technics, Ensemble Methods (RF, Boosting), tuning probability threshold,…..
https://chatgpt.com/c/09a66e4e-80c6-4fbd-bf4e-73a2b3e44afd